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Abstract. Without prior knowledge, distinguishing different lan- 
guages may be a hard task, especially when their borders are perme- 
able. 

We develop an extension of spectral clustering — a powerful unsu- 
pervised classification toolbox — that is shown to resolve accurately 
the task of soft language distinction. At the heart of our approach, we 
replace the usual hard membership assignment of spectral clustering 
by a soft, probabilistic assignment, which also presents the advan- 
tage to bypass a well-known complexity bottleneck of the method. 
Furthermore, our approach relies on a novel, convenient construc- 
tion of a Markov chain out of a corpus. Extensive experiments with a 
readily available system clearly display the potential of the method, 
which brings a visually appealing soft distinction of languages that 
may define altogether a whole corpus. 

1 Introduction 

This paper is concerned with unsupervised learning, the task that 
consists in assigning a set of objects to a set of q > 1 so-called 
clusters. For the purpose of text classification, we suppose that ob- 
jects are words: each cluster should define a set of words which are 
syntactically close to one another, while different clusters should be 
as different as possible from the syntactic standpoint. 

There are two main ways of understanding what is meant by "syn- 
tactically close": following |7], it is generally acknowledged that 
two combined "axes" define the combinatorial possibilities of a lan- 
guage's syntax: the syntagmatic axis is the linear dimension of the 
text, where occurrences of words actually appear one after another; 
the paradigmatic axis is the dimension of all possible alternative 
choices available at a given position to a speaker or writer. Hence, 
two words are "syntactically close" to one another on the syntag- 
matic axis, if they often tend to appear together, at specific relative 
positions, in common contexts; they are close to one another on the 
paradigmatic axis, if they appear alternatively in similar positions 
within comparable contexts. The first criterion defines a measure of 
word distance within a text, and is suited to studying problems such 
as internal coherence of text segments [ 21 j; the second defines a mea- 
sure of word distance within a class, and is suited to studying prob- 
lems like defining relevant syntactic |[T9l E) or semantic (T71 01 [6) 
categories. In the frame of this paper, attention will be drawn on the 
first one of these problems, which has not been very extensively tack- 
led. 



A challenging application in the field of linguistic engineering is 
language identification and comparison. Language identification for 
itself is now considered an easy task on monolingual text documents, 
as two very reliable methods (based on frequency analyses on the 
most frequent words, and on the most frequent n-byte sequences) 
may be mixed to get optimal results 1 10 1; yet some work remains to 
be done for the task of language identification on multilingual docu- 
ments, where a non-trivial question is the definition on language sec- 
tion boundaries [22]. This question is particularly interesting when 
the border between different languages is permeable. This is typi- 
cally the case within the group of Creole languages of the Caribbean 
region. 

Creole languages in their present form have emerged during a 
short period of time (probably less than one century, in the late 17th 
century), in very atypical conditions of language transmission and 
evolution. They have developed in the newly colonized West Atlantic 
territories (in the Caribbean islands and on both American main- 
lands), on the basis of Western European languages spread by the 
nations most involved in colonization (French, English, Portuguese 
and Dutch), but in sociolinguistic situations where, due to the rapidly 
growing slave trade economy, there could be, within every single 
generation, less than 50% of native speakers of the language in its 
current state of development involved in the speaking community. 
Even in periods of fast language evolution (like for the case of Middle 
English between the 11th and 15th century), no European language 
has experienced such a phase of "linguistic stress". After the 18th 
century, the language situation has somehow stabilized, although 
Creoles still undergo linguistic change at a pace which is probably 
faster than many well established languages. 

In at least some cases, the Creole language has remained in con- 
tact with its "lexifier" European language (none of those has in the 
meantime become extinct), in sociolinguistic situations which have 
sometimes been coined as "diglossic": this has especially been the 
case for English-based Creoles like Jamaican or Gullah (spoken in 
the USA states of South Carolina and Georgia); and, closer to our 
study's main focus, for French-derived Creoles spoken on the ter- 
ritories of Haiti, Guadeloupe, Martinique and French Guiana. In a 
diglossic situation, the European language is still in use as the official 
and prestige language, while the Creole language is the vernacular. 
This leads to very frequent code- switching and to intermingling of 
languages in several domains. Thus, when it comes to corpora of lin- 
guistic productions coming from this type of speech community, the 



question of the "border" between languages can be asked on two dis- 
tinct planes: on the plane of structural (merely linguistic) properties, 
and on the plane of the situations of use. 

The first question involves problems of language clustering. A 
learning task might consist in drawing a cladogram (family tree) of 
various French-based Creole languages on the basis of their struc- 
tural similarities. Studying "paradigmatic" syntactic closeness (i.e. 
context similarity, see above), might also help define the most appro- 
priate part-of- speech categorization for those languages, and check 
the appropriateness of eurocentric grammatical descriptions in their 
case. But this is not the main scope of the present paper. 

The second question involves delimiting the use of every language 
in multilingual texts or speech productions, and this is the task on 
which we will now concentrate. 

In the last few years, the most prominent developments of text 
classification have concerned supervised classification {i.e. texts have 
explicit labels to predict), with the advent of algorithms powerful 
enough to process texts described with the simplest conventions {e.g. 
attribute- value vectors) lITTI [T2l 1781 . A glimpse at its unsupervised 
side easily reveals that classification has so far comparatively re- 
mained quite distant from text classification, at least for its most 
recent breakthroughs in learning / mining. Spectral clustering is a 
very good example, with such a success that its recent develop- 
ments have been qualified elsewhere as a "gold rush" in classifica- 
tion O [T] O EJ [13] Q31 [T5J (and many others), pioneered by works 
in spectral graph theory [5| and image segmentation [20|. Roughly 
speaking, spectral clustering consists in finding some principal axes 
of a similarity matrix. The subspace they span, onto which the data 
are projected, may yield clusters optimizing a criterion that takes into 
account both the maximization of the within-cluster similarity, and 
the minimization of the between-clusters similarity. Among the at- 
tempts to cast spectral clustering to text classification, one of the first 
builds the similarity matrix via the computation of cosines between 
vector-based representations of words, and then builds a normalized 
graph Laplacian out of this matrix to find out the principal axes 13. 

The papers that have so far investigated spectral clustering have 
two commonpoints. First, they consider a hard membership assign- 
ment of data: the clusters induce a partition of the set of objects. It is 
widely known that soft membership, that assigns a fraction of each 
object to each cluster, is sometimes preferable to improve the solu- 
tion, or for the problem at hand. This is clearly our case, as words 
may belong to more than one language cluster. In fact, this is also the 
case for the probabilistic (density estimation) approaches to cluster- 
ing, pioneered by the popular Expectation Maximization [8|. Their 
second commonpoint is linked to the first: the solution of cluster- 
ing is obtained after thresholding the spectral clustering output. This 
is crucial because in most (if not all) cases, the optimization of the 
clustering quality criterion is NP-Hard for the hard membership as- 
signment |20l . To be more precise, the principal axes yield the poly- 
nomial time optimal solution to an optimization problem whose cri- 
terion is the same as that of hard membership (modulo a constant fac- 
tor), but whose domain is unconstrained. Hard membership makes it 
necessary to fit (threshold) this optimal solution to a constrained do- 
main. Little is currently known for the quality of this approximation, 
except for the NP-Hardness of the task. 

This paper, which also focuses on spectral clustering, departs from 
the mainstream for the following reasons and contributions. First 
(Section 2), compared to text classification approaches, we do not 
build the similarity matrix in an ad hoc manner like |2|. Rather, we 
consider that the corpus is generated by a stochastic process follow- 
ing a popular bigram model [ 16 1, out of which we build its maximum 



likelihood Markov chain. This particular Markov chain satisfies all 
conditions for a convenient spectral decomposition. Second (Section 
3), we propose an extension of spectral clustering to soft spectral 
clustering, for which we give a probabilistic interpretation of the 
spectral clustering output. Apart from our task at hand, which jus- 
tifies this extension, we feel that such results may be of independent 
interest, because they tackle the interpretation of the tractable part 
of spectral clustering, avoiding the complexity gap that follows after 
hard membership. Last (Section 4), we provide experimental results 
of soft spectral clustering on a readily available system; experiments 
clearly display the potential of this method for text classification. 

2 Maximum likelihood Markov chains 

In this paper, calligraphic faces such as X denote sets and blackboard 
faces such as S denote subsets of R, the set of real numbers; whenever 
applicable, indexed lower cases such as x% {% — 1,2,...) enumerate 
the elements of X. Upper cases like M denote matrices, with rriij 
being the entry in row i, column j of M; M T is the transposed of 
M. Boldfaces such as x denote column vectors, with x% being the 
i th element of x. A corpus C is a set of texts, \T\ , 72, %n}, with 
m the length of the corpus. VI < k < m, text Tk is a string of 
tokens (words or punctuation marks), Tk = Uk,iuJk,2---^k,\T k \, of 
size \Tk\, with |.| the cardinal (whole number of tokens of Tk). The 
size of the corpus, \C\ = n, is the sum of the length of the texts: n — 
i 1^1- The size of a corpus is implicitly measured in words, but 
it may contain punctuation marks as well. The vocabulary of C, V, is 
the set of distinct linguistically relevant words or punctuation marks, 
the tokens of which are contained in the texts of C. The size of the 
vocabulary is denoted v = \V\. The elements of V = { v± , V2 , . . . , v v } 
are types: each one is unique and appears only once in V. Vi, j G 
{1, 2, v}, we let m denote the number of occurrences of type i 
in C, and mj the number of times a word of type i immediately 
preceeds (left) a word of type j in C. Finally, we denote 9Jt a (first 
order) Markov chain, with state space V, and transition probability 
matrix P VX v P is row stochastic: pij > (1 < i, j < v) and 
XDJ=i PiJ = 1 (1 ^ * ^ v )' Suppose that C is generated from 9Jt. 
The most natural way to build P is to maximize its likelihood with 
respect to C. The solution is given by the following folklore Lemma. 

Lemma 1 The maximum likelihood transition matrix P is defined 
by Pi,j — nij/rii, with 1 < i,j < v. 

The computation of P as in Lemma [T] is convenient if we make the 
assumption that a text is written from the left to the right. This cor- 
responds to an a priori intuition of speakers of European languages, 
who have been taught to read and write in languages where the graph- 
ical transcription of the linearity of speech is done from left to right. 
However, a more thorough reflection on the empirical nature of the 
problem has lead us to question this approach. The method being 
developed should be able to work on any type of written language, 
making no assumption on its transcription conventions. Some lan- 
guages (among which important literary languages like Hebrew or 
Arabic) have a tradition of writing from right to left, and this some- 
times goes down to having the actual stream of bytes in the file also 
going "from right to left" (in the file access sense). The new Unicode 
standard for specifying language directionality circumvents this, by 
allowing the file to always be coded in the logical order, and manag- 
ing the visual rendering so that it suits the language conventions, even 
in the case of mixed-language texts (i.e. English texts with Hebrew 
quotes); but large corpora still are encoded in the old way, and the 



program should not be sensitive to this. More generally, the method 
we propose should be designed to accept any file as a statical, em- 
pirical object, and should be able to find laws and regularities in it, 
making no more postulates than necessary. 

We have found a convenient approach to eliminate this direction- 
ality dependence. It also has the benefit of removing the dependence 
in the choice of the first word to write down a text. Everything is like 
if we were computing the likelihood of C with respect to the writing 
of its texts in a circular way. Figure Q]presents the writing of text T k : 
we pick a random word, and then move either clockwise or counter 
clockwise to write words. After we have made a complete turn, ev- 
erything is like if we had written twice T k . The following Lemma, 




Figure 1. A "circular" generation of a text Tk eliminate both the direction 
for writing C (arrows) and the choice of the first word written. 

whose proof is direct from Lemma [T] gives the new maximum like- 
lihood transition matrix P (proof omitted to save space). 

Lemma 2 With the circular writing approach, P — D~ X W , with 
W VX v such that Wij — (mj + rij^)/2, and D vxv diagonal with 

di,i — di — TL%. 

From now on, we use the expression for P in Lemma|2] The circular 
way to write down the texts of C has another advantage: 9Jt is irre- 
ducible. Let us make the assumption that 9Jt is also aperiodic. This 
derives from a clearly mild assumption, namely that 9Jt satisfies for 
a vocabulary large enough, as in this case loops of arbitrary long size 
tend to appear between words. Irreducibility and aperiodicity imply 
that 9Jt is ergodic, i.e. regardless of the initial distribution, 9Jt will 
settle down over time to a single stationary distribution n solution of 
P Ti — 7r, with TTi — rii/n |[T3l . 

3 From hard to soft spectral clustering 

Fix q > 1 some user-fixed integer that represents the number of 
clusters to find. The ideal objective would be to find a mapping 
Z : V — > S> q , with § = {0, 1}, mapping that we represent by a 
matrix Z = [zi, Z2, z q ] G §> VXq . Under appropriate constraints, 
the mapping should minimize a multiway normalized cuts (MNC) 
criterion |[T1l2l[3l[T3l[T4l[T5ll20l: 

q 

arg min fi(Z) = V K k (Z)/a k (Z) , (1) 

Z(E gvXq ' 

k=l 

s.t. Z T Z positive diagonal 
s.t. tr(Z T Z) = v , 

with Kk(Z) = Yl V i,j=i w i,j( z i,k - Zj,k) 2 and a k {Z) = 
Yli=i z i,kdi- Since this does not change the value of /jl(Z), we sup- 
pose without loss of generality that Wi,i = 0, VI < i < v. Because 



of the constraints on Z in 0, it induces a natural hard membership 
assignment on V (i.e. a partition), as follows: 

V k = {vi : z ijk = 1} ,V1 < k < q . (2) 

There is one appealing reason why clustering gets better as MNC in 
([T} is minimized. Suppose we start (at t = 0) a random walk with the 
Markov chain 9Jt, having transition matrix P, and from its stationary 
distribution it. Let [Vk]t be the event that the Markov chain is in 
cluster k at time t > 1. We obtain the following result |[T3l : 

H(Z) = 2^Pr([W] t+ i|[V fe ] f ) (3) 

k=l 

for the partition defined in eq. (E}. Thus, /jl(Z) sums the probabilities 
of escaping a cluster given that the random walk is located inside 
the cluster: minimizing /jl(Z) amounts to partitioning V into "stable" 
components with respect to 9Jt. Unfortunately, the minimization of 
MNC is NP-Hard, already when q = 2 |20|. To approximate this 
problem, the output is relaxed and the goal rewritten as seek: 

<? 

arg min u(Y) = V« fc (y) , (4) 

k = l 

s.t. Y T DY = I . 

This problem is tractable by a spectral decomposition of 9JI (see e.g. 
1 20 ]), which yields that Y is the set of the q column eigenvectors as- 
sociated to the smallest eigenvalues of the generalized eigenproblem 

(VI < k < q): 

(D - W)y k = X k Dy k , (5) 

and it comes v(Y) = 2J2l=i ^ we su PP ose ^ without loss of 
generality, that eigenvalues are ordered, Ai < A2 < ... < A q , then it 
easily comes Ai = 0, associated to a constant eigenvector yi l20l . 
People usually discard this first eigenvector, and keep the following 
ones to compute Z after a heuristic thresholding of Y. The proof 
that this thresholding is heuristic follows from the fact that if we 
restrict J4]) to thresholded matrices (whose rows come from a set of 
at most q distinct row vectors), then it becomes equivalent to 0, i-e. 
intractable Q. 

Notice however that the spectral relaxation finds the optimal so- 
lution to © in time 0(qv 3 ) (without algorithmic sophistication), 
from which the heuristic thresholding only aims at recovering a hard 
membership assignment. Whenever a soft membership assignment is 
preferable, we show that one can be obtained directly from Y, which 
is optimal with respect to a criterion similar to (|3]l, while its compu- 
tation bypasses the complexity bottleneck of hard membership, thus 
killing two birds in one shot. 
For this purpose, define matrix Y from Y as: 

Vi,k = diylk • ( 6 ) 

Then, we have Y T 1 — 1, i.e. each column vector y k of Y defines 
a probability distribution over V. Since y k is associated to principal 
axis k, it seems natural to define it as the probability to draw Vi given 
that we are in Vk, the cluster associated to the axis. Following the 
notations of eq. (|3]l, we thus let: 

Vi,k = Pr(Mt|[V fc ]t) (7) 

be the probability to pick type Vi, given that we are in cluster k, at 
time t. This is our soft membership assignment: axes define clusters, 
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Figure 2. Experiments on multilingual passages of Cinderella. Each row crops a borderline between two languages (from the top to the bottom): French / 
German, Spanish / English. The bottom row displays the quantities that are represented by RGB colors in each column, where each color level is associated to a 
principal axis k £ 2,3,4. 



and the column vectors of Y define the distributions associated to 
each cluster. Notice that this provides us with a sound extension of 
the hard MNC solution © for which y it k equals 1 for a single cluster, 
and zero for the other clusters (VI < i < v). We also have more, 
as this brings a direct and non trivial generalization of ([3]). Define 
matrix P (fe) such that = (wijyj } k)/(diyi,k). P^- is akin to 
the difference between the probabilities of reaching respectively Vk 
and Vk in j, given that the random walk is located on i (Vt > 0): 
Pij = Pr([vj AVfc]t+i|[vi]t)-Pr([vj AV^ t +i|[vi] t ). Provided we 
make the assumption that reaching a type outside cluster k at time 
t + 1 does not depend on the starting point at time t, an assumption 
similar to the memoryless property of Markov chains, we obtain our 
main result, whose proof relies on applications of Bayes rules. 

Theorem 1 v(Y) = 4 J2l=i | [V k ]t)- 

By means of words, solving ^ brings the soft clustering whose com- 
ponents have optimal stability, and whose associated distributions are 
given by Y. As a consequence, we easily obtain that y± = 7r, the 
stationary distribution. This is natural, as this is the observed distri- 
bution of types, i.e. the one that best explains the data. In previous 
results, | 2 | choose the Brown corpus and make a 2D plot of some 
spectral clustering results on the second and third principal axes, af- 
ter having made a prior selection of the most frequent words (to be 
plotted). From yi, it comes that this amounts to make a selection of 
words according to the first principal axis, which is not plotted. 

4 Experiments 

A computer program has been developed to implement word classifi- 
cation and text segmenting according to the method explained above. 
It is publicly available through a CgQ- The program takes a text of 
arbitrary long size as input. First, it automatically detects the text 
format and encoding, and converts everything to raw text encoded 
in Unicode UTF-8. Second, it performs a stage of tokenization, i.e. 
it segments the raw stream of bytes into tokens of words, figures or 
typographical signs. Third, it builds an index table suited for fast ac- 
cess to word type information (designed on the lexical tree, or trie, 
model). Fourth, it computes the bigram transition matrix T (mj) 
(lemma [B, by moving a contextual window along the tokens put in 
their text order, and incrementing riij for every seen occurrence of 
a transition {uji,ujj)\ W and then P (as given by lemma [2]) are then 
computed from T. Fifth, it makes use of the linear algebra functions 



of the LAPACK librar)Qto compute the eigenvalues and eigenvectors 
of the matrices. 

The program's results are displayed in a way designed to give the 
user a visual representation of every word's soft membership to the 
clusters. For this purpose, we can represent each word with a RGB 
color, where each color level is associated to some principal axis k, 
and scales the component of yi^ for each word i. This allows the 
choice of three axes to compute the color. Let us assume we want 
to be able to display \ different color levels on each axis (in our il- 
lustrations, x — 5); For every selected component k, the v different 
values for y it k are grouped into x connected intervals ii, h . . . I x , 
not necessarily of the same length, such that Uf =1 Ii = [0,1], and 
such that every interval contains the same number of points (approx- 
imately v/5): this yields the maximum visual contrast. 

Figure [2 presents such an experiment on a 1Mb text, containing 
four versions of the same tale (Cinderella, from the Grimm Brothers), 
in four languages: French, German, Spanish, and English. We have 
plotted both yk (left column) and yk (right column) for each word, 
removing punctuation marks from the spectral analysis (this explains 
why they are displayed in white). 

Both columns show that the representations manage to cluster all 
languages. The right column also gives access to the sign of yk 
(in this case Uf =1 Ii = [—1, 1]), while values for the left colum, 
Pr([t>j]t|[Vfc]t) = d%y\ki belongs to a smaller interval, [0, 1]. In this 
context, it is quite interesting to see that the contrasts between lan- 
guages is marked on both columns. What is most interesting in this 
context is that the contrast inside each language is actually sharper 
for Pr([vi]t|[Vfc]t). While the colors distinguish the languages, they 
also "order" them in some sense. From the average color levels of 
each language, we can say that R(ed) is principally German, G(reen) 
is principally English, and B(lue) is principally Spanish. French is 
somewhere in between all of them. What is much interesting is that 
all this is in good accordance with the roots of these four languages, a 
fact which is of course utterly out of sight for the computer program. 

An even more interesting experiment has consisted in trying the 
program on texts where languages are more intricately mixed. This 
is quite typically so in literature from multilingual regions, like in 
the case of the Creole- speaking communities mentioned in the in- 
troduction. The linguistic situation actually is reflected in the lit- 
erature generated in those regions; as an example, we display (fig. 
[3]) an extract of a bilingual novel from a Caribbean author, where 
segments in French and Creole alternate. In this case, rather than 
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Figure 3. Displaying word coordinates on an RGB space for an extract of a bilingual novel where French and French-based Creole fragments are intertwined. 
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erarchical clustering, as we have shown. Moreover, we believe that 
it also has a potential to prove useful, with some more research, in 
other types of applications like: identifying mixed discourse genres 
(e.g. formal vs. informal); identifying segments of text with different 
topics; or spotting sources within texts of mixed authorship. 

In future works, we plan to drill down our soft spectral clustering 
results, to converge towards a complete probabilistic interpretation 
of spectral decomposition. Another target of our future research is 
using this method on other matrices computed from a text, like the 
matrix of distributional similarity (measuring "paradigmatic" syntac- 
tic distance instead of "syntagmatic" syntactic distance, cf. Introduc- 
tion), with the aim of clustering syntactic and semantic categories in 
loosely-described languages. 



Figure 4. The 40 most frequent words in a French-Creole Caribbean novel, 
projected on a plane along the second and third principal axes. 

plotting Pr([vi]t|[Vjfe]t) for soft spectral clustering, we have plotted 
Pr([Vfc]t|Nt) = Pr([^] t |[V fc ]0Pr([V/c]t)/Pr(>]t), using 7T, = 
Pr([vi]t), and solving the linear system p* — Y~ x -k to find p* k = 
Pr([Vfc]t) (k = 1, 2, v). Since we plot color levels for each word, 
it should be more convenient to yield sharper visual differences be- 
tween languages. While both languages share many words, the re- 
sults display quite surprising contrasts, and these are actually sharper 
when plotting Pr([Vfc]*|[vi]*). In the crop of fig. [3] the program has 
even managed to extract a short French sentence (quel malheur, quel 
grand malheur pour nous) out of a Creole segment. Finally, Figure 
HI presents a 2D plot on clusters k = 2, 3 of the forty most frequent 
words. It was interesting to notice that two soft clusters were enough 
to make appear a clear frontier between the two languages, though 
each side of this frontier obviously contains words that are found on 
both languages (a, an, la, ni, ou, tout, y, etc.). 

5 Conclusion 

In this paper, we have provided a new way to build a Markov chain 
out of a text, which satisfies all conditions for a convenient spectral 
decomposition. We have provided a novel way to interpret the results 
of spectral decomposition, in terms of soft clustering. This proba- 
bilistic interpretation allows to avoid the complexity gap that follows 
from traditional hard spectral clustering. This brings a natural ap- 
proach to process a Markov chain, and make a soft clustering out 
of its state space. We bring an extensive comparison of hard and soft 
spectral clustering, along with some extended results on conventional 
spectral clustering. 

The experiments clearly display the potential of such a method. 
It has the ability to separate two implicit Markov processes which 
have contributed in a mixed proportion to the generation of one sin- 
gle observable output. The results presented here are obtained from 
a simple bigram model; they can be improved, at the cost of some 
computation time, by taking into account variable length n-grams. 
The property of separating two implicit generation processes has an 
obvious application in language identification, and in language hi- 
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